--- title: "Exploratory Data Analysis: OpenBeta Climbing Data (2022)" author: "Nina" date: '2022-05-30' #slug: EDA-open-beta categories: ["R", "Visualizations"] tags: ["R Markdown", "plot", "regression"] ---
TEST
The population of climbers is exploding and we need more and better access to data to make the sport more accessible. I am using data provided from OpenBeta, a nonprofit built and run by climbers that enables “open access and innovative uses of climbing data” (1). I plan to build a Shiny app that will include a map of sport and trad climbing routes on the West Coast ranked by route quality and a recommendation engine tailored to the user of any skill level.
Following the sport of climbing’s Olympic debut in Tokyo 2021 and the success of films like Free Solo featuring Alex Honnold (2018) and The Dawn Wall featuring Tommy Caldwell (2018), the industry is seeing historic growth and opportunities for new, profitable markets. According to Forbes, Google searches that included the term “climbing” reached an all time high in the first week of August 2021; the same time frame that men’s and women’s combined events were held (2). Not only is the sport gaining a bigger audience, but it is also attracting regular people like you and me to take a crack at the crag. Following the pandemic, nearly 100 climbing gyms have opened in North America and profits of El Cap, one of America’s largest operators of climbing facilities, saw a 100% increase in online interactions (2).
One concern with the sport’s booming popularity is the barrier to entry and as a result, there has been a push to make the sport more accessible. In addition, there are only a few databases with outdoor climbing routes that are in access to the public like The Mountain Project or 8A. Without these platforms, climbers who are looking to hit their local crag or boulder may not be able to find routes or know about the quality of them if they do not already have community or word of mouth. While these websites have provided helpful tools to climbers of all experience and skill levels, they are still heavily lacking data and scrapings of these platforms have resulted in DMCA takedowns or lawsuits (3). At the bare minimum, we need better and easier access to climbing data so that data scientists like myself can work to advance the sport for others. As the sport grows so will the influx of data, and with any field that is expanding and rapidly changing, data science can add value to it by making better-informed decisions for multiple stakeholders, generating new insights about its players and audience, and increasing the overall experience for users.
OpenBeta is a non-profit built and run by climbers that enables “open access and innovative uses of climbing data” (1). Though they have also faced several challenges with their attempt to use onX’s data from the Mountain Project with copyright infringement and blocked repositories, according to Outside Learn (3). At the moment their data is public, and Github recently reversed the DMCA takedown thanks to legal efforts from the owner, Viet Nguyen, who is “empowering the community with open license climbing betas and source tools” (1). His goal for OpenBeta is to make climbing data more like an open source project, which in turn would help platforms like Mountain Project to increase their recommendation systems, geolocation data, and the accuracy of submissions (3). In addition to pushing for accessible data, the OpenBeta also posts articles that fit the needs of any climber in STEM: tutorials, current events, and project inspirations like recommendation systems and route quality maps. The community that OpenBeta is fostering aligns heavily with the forward mentality of climbing currently which is: don’t be a gatekeeper, spread the beta, and anyone is capable enough.
As a young climber and data scientist, I found myself incredibly inspired by OpenBeta’s work and wanted to support the nonprofit by using their data and some of their resources for my capstone project. I plan to leverage climbing data to influence decision making for climbers of all skill sets and as a result, contribute to the overarching goal of OpenBeta which is to make the sport of climbing safer, more knowledgeable, and more accessible. Recommendation systems are extremely powerful and if done well, can be a great tool for young climbers when exploring outdoor routes. To get a better understanding of the data and to ensure the viability of this goal, I am performing an exploratory data analysis of the OpenBeta data.
# Setup
Here I am pulling all ratings from the Open Beta which I will then take a subset of to getWest Coast ratings. I wanted to work with West Coast recommendations for a couple reasons. For one, the Sierra Nevada of California and the Cascade Range of the Pacific Northwest are prime western U.S. rock climbing locales. In addition the West Coast is scattered with popular climbing spots (i.e. those in Yosemite National Park) but there is a common misconception that these areas only have expert graded routes. In reality the opposite is true, and there are actually more beginner to moderate routes than expert ones. Therefore with an application like this one, anyone can have access to local classics even in lower grades. This dataset contains all route ratings in US along with route ID, grade, name, and type (trad, sport, ice, bouldering, etc.).
I’m going to look at only trad and sport routes on the West Coast, which are the most two popular types of outdoor climbing. A limit of this method is that we lose routes that could be both sport and trad.
or_ratings <- or_ratings %>%
mutate(trad = ifelse(str_extract(type, "tr") == "tr", 1, 0)) %>%
mutate(sport = ifelse(str_extract(type, "sp") == "sp", 1, 0)) %>%
mutate(trad = ifelse(is.na(trad), 0, trad)) %>%
mutate(sport = ifelse(is.na(sport), 0, sport)) %>%
filter(trad != sport) %>%
mutate(type = ifelse(trad == 1, "trad", "sport")) %>%
select(-trad, -sport)
Next, I’m pulling some aggregate rating data from OpenBeta along with the location of parent walls to use for plotting. The features we will use from this dataset are the parent wall ID, name, and location along with the state and ARQI rating (we will explain this metric later).
or_quality <-
read_csv("or_quality_data.csv")
## New names:
## * `` -> ...1
## Rows: 2767 Columns: 20── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): route_name, type_string, parent_sector, parent_loc, nopm_YDS, safe...
## dbl (13): ...1, route_ID, sector_ID, num_votes, adjusted_num_votes, mean_rat...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
or_quality
## # A tibble: 2,767 × 20
## ...1 route_name route_ID type_string sector_ID parent_sector parent_loc
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 29960 Skelator 107358365 trad 107357989 Castle [-121.0386,…
## 2 29961 As You Wish 107357995 trad 107357989 Castle [-121.0386,…
## 3 29962 African Swa… 107358243 trad 107357989 Castle [-121.0386,…
## 4 29963 Mekaneck 107358348 trad 107357989 Castle [-121.0386,…
## 5 29964 Holy Hand G… 107358251 trad 107357989 Castle [-121.0386,…
## 6 29965 Fezzik 107358339 trad 107357989 Castle [-121.0386,…
## 7 29966 Inconceivab… 107358324 trad 107357989 Castle [-121.0386,…
## 8 29967 Trojan Rabb… 107358465 trad 107357989 Castle [-121.0386,…
## 9 29968 Orko 107358440 trad 107357989 Castle [-121.0386,…
## 10 29969 Big Arch Co… 105842927 trad 105842915 Great Arch, … [-122.14883…
## # … with 2,757 more rows, and 13 more variables: num_votes <dbl>,
## # adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## # mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## # ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## # state <chr>
Now I’m pulling the latitude and longitude from the parent_loc variable into two separate columns for plotly.
orGeo <- or_quality %>%
mutate(lon = as.numeric(str_extract(parent_loc, '(-|)\\d+.\\d+')),
lat = as.numeric(str_extract(parent_loc, '\\s(-|)\\d+.\\d+'))) %>%
filter(!is.na(lat), !is.na(lon)) %>%
mutate(route_ID = as.character(route_ID))
orGeo
## # A tibble: 2,767 × 22
## ...1 route_name route_ID type_string sector_ID parent_sector parent_loc
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 29960 Skelator 107358365 trad 107357989 Castle [-121.0386,…
## 2 29961 As You Wish 107357995 trad 107357989 Castle [-121.0386,…
## 3 29962 African Swa… 107358243 trad 107357989 Castle [-121.0386,…
## 4 29963 Mekaneck 107358348 trad 107357989 Castle [-121.0386,…
## 5 29964 Holy Hand G… 107358251 trad 107357989 Castle [-121.0386,…
## 6 29965 Fezzik 107358339 trad 107357989 Castle [-121.0386,…
## 7 29966 Inconceivab… 107358324 trad 107357989 Castle [-121.0386,…
## 8 29967 Trojan Rabb… 107358465 trad 107357989 Castle [-121.0386,…
## 9 29968 Orko 107358440 trad 107357989 Castle [-121.0386,…
## 10 29969 Big Arch Co… 105842927 trad 105842915 Great Arch, … [-122.14883…
## # … with 2,757 more rows, and 15 more variables: num_votes <dbl>,
## # adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## # mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## # ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## # state <chr>, lon <dbl>, lat <dbl>
As part of the EDA process, we first take a subset of our datasets to only include West Coast information. We used the features from route_quality dataset to join onto our ratings dataset: wc_ratings.
#wcGeo <- rqGeo %>%
#filter(state == "Oregon" | state == "Washington" | state == "California")
or_ratings <- or_ratings %>%
mutate(route_id = as.character(route_id)) %>%
inner_join(orGeo, by = c("route_id" = "route_ID"))
#wcGeo
or_ratings
## # A tibble: 56,024 × 27
## users ratings route_id name grade type ...1 route_name type_string
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 d093d0f400… 1 1181781… Twin … 5.9 trad 30010 Twin Sister trad
## 2 f56f930bec… 3 1192561… Winds… 5.12a sport 70489 Winds of P… sport
## 3 04fd5a96a0… 3 1192561… Winds… 5.12a sport 70489 Winds of P… sport
## 4 6f83adad05… 3 1192561… Winds… 5.12a sport 70489 Winds of P… sport
## 5 54b89a6980… 2 1192561… Winds… 5.12a sport 70489 Winds of P… sport
## 6 f766420d4f… 4 1062665… South… 5.2 trad 96351 South Ridg… trad
## 7 5444a5e886… 4 1062665… South… 5.2 trad 96351 South Ridg… trad
## 8 4b7315ea87… 4 1062665… South… 5.2 trad 96351 South Ridg… trad
## 9 14bf58a87f… 4 1062665… South… 5.2 trad 96351 South Ridg… trad
## 10 5d469f9184… 4 1062665… South… 5.2 trad 96351 South Ridg… trad
## # … with 56,014 more rows, and 18 more variables: sector_ID <dbl>,
## # parent_sector <chr>, parent_loc <chr>, num_votes <dbl>,
## # adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## # mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## # ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## # state <chr>, lon <dbl>, lat <dbl>
We can see that our dataset is dominated by trad routes and observations that come from California. This raises a bias concern with an Item Based Collaborative Filtering recommendation system. It may be worth keeping the California data regardless, as the data is the most complete and probably more accurate.
library(RColorBrewer)
library(ggthemes)
or_ratings %>%
ggplot(aes(x = state, fill = type)) +
geom_histogram(stat = "count") +
scale_fill_brewer(palette = "Set2") +
theme_tufte() +
labs(title = "Sport Route and Trad Route Counts by State\n",
x = "\nState\n", y = "\nCount\n") +
theme(axis.title.x = element_text(size = 10),
axis.title.y = element_text(size = 10))
## Warning: Ignoring unknown parameters: binwidth, bins, pad

We will only take routes with grades that are in the Yosemite Decimal System, which is the traditional difficulty rating for routes in the US.
my_levels = c(str_sort(or_ratings$grade) %>% unique())
or_ratings <- or_ratings %>%
filter(grade %in% c(my_levels[1:2], my_levels[48:56], my_levels[3:47]))
or_ratings$grade <- as_factor(or_ratings$grade)
levels(or_ratings$grade) = c(my_levels[1:2], my_levels[48:56], my_levels[3:47]) # easy to expert
levels(or_ratings$grade)
## [1] "5.0" "5.1" "5.2" "5.3" "5.4" "5.5" "5.6"
## [8] "5.7" "5.7+" "5.8" "5.8-" "5.10" "5.10-" "5.10+"
## [15] "5.10a" "5.10a/b" "5.10b" "5.10b/c" "5.10c" "5.10c/d" "5.10d"
## [22] "5.11" "5.11-" "5.11+" "5.11a" "5.11a/b" "5.11b" "5.11b/c"
## [29] "5.11c" "5.11c/d" "5.11d" "5.12" "5.12-" "5.12+" "5.12a"
## [36] "5.12a/b" "5.12b" "5.12b/c" "5.12c" "5.12c/d" "5.12d" "5.13"
## [43] "5.13-" "5.13+" "5.13a" "5.13a/b" "5.13b" "5.13b/c" "5.13c"
## [50] "5.13c/d" "5.13d" "5.14" "5.14-" "5.14a" "5.14b" "5.14c"
There are actually more beginner to moderate routes than expert ones! According to the Yosemite Decimal System, a 5.0 to 5.7 is considered easy, 5.8 to 5.10 is considered intermediate, 5.11 to 5.12 is hard, and 5.13 to 5.15 is reserved for a very elite few. This means that the app will be of use to any climber, as some local classics come even in lower grades.
or_ratings <- or_ratings %>%
mutate(level = case_when(
grade %in% c(my_levels[1:2], my_levels[51:57]) ~ "easy",
grade %in% c(my_levels[58:63], my_levels[3:12]) ~ "intermediate",
grade %in% c(my_levels[13:32]) ~ "hard",
grade %in% c(my_levels[33:50]) ~ "elite"))
or_ratings$level <- factor(or_ratings$level,
levels = c("easy", "intermediate", "hard", "elite"))
or_ratings %>%
group_by(grade, level) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(perc = count/sum(count)) %>%
ggplot(aes(x = reorder(grade, -perc), y = perc)) +
geom_col(aes(fill = level)) +
scale_fill_brewer(palette = "Set2") +
labs(x = "\nGrade (YDS)", y = "Percent\n", title = "Proportion of Routes in the West Coast By Grade") +
theme_tufte() +
theme(axis.text.x = element_text(angle = 45, size = 5, vjust = 0.5))
## By state
We see that across all three states, a majority of the top proportion of routes per grade are in the easy to intermediate range with a handful from Oregon and Washington being hard.
or_ratings %>%
group_by(state, grade, level) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(perc = count/sum(count)) %>%
ungroup() %>%
group_by(state) %>%
slice_max(order_by = perc, n = 5) %>%
ggplot(aes(x = reorder(grade, -perc), y = perc)) +
geom_col(aes(fill = level)) +
facet_wrap(~state, scales = "free") +
scale_fill_brewer(palette = "Set2") +
labs(x = "\nGrade (YDS)", y = "Percent\n", title = "Top Proportion of Routes per Grade by State") +
theme_tufte() +
theme(axis.text.x = element_text(angle = 45, size = 8, vjust = 0.5))

As a metric for route quality, we can look at the aggregate metric RQI or ARQI precalculated in the data. The RQI is equal to S(1-1/N) where S is the average stars (or median) and N is the number of votes. As N approaches infinity, (1-1/N) approaches 1 and RQI approaches S. One issue with this metric is that harder routes get fewer ascents and therefore less votes, making it difficult for hard routes to make it into the “classic” class. We will use the Adjusted RQI (ARQI), which corrects for bias of RQI towards easier routes by adjusting the number of votes and therefore doesn’t make route quality a “popularity metric.” The ARQI is equal to S(1-1/Nw) where Nw is the number of weighted or adjusted votes and is determined by the votes-per-route for each grade. According to OpenBeta, - Classic: ARQI >= 3.5 - Area Classic: 2.5 <= ARQI < 3.5 - Good: 1.5 <= ARQI < 2.5 - Bad: 0.5 >= ARQI < 1.5 - Bomb: ARQI < 0.5
or_ratings <- or_ratings %>%
distinct() %>%
mutate(class = case_when(
ARQI_median >= 3.5 ~ "classic",
ARQI_median >= 2.5 & ARQI_median < 3.5 ~ "area classic",
ARQI_median >= 1.5 & ARQI_median < 2.5 ~ "good",
ARQI_median >= 0.5 & ARQI_median < 1.5 ~ "bad",
ARQI_median < 0.5 ~ "bomb")) %>%
group_by(parent_sector) %>%
mutate(best_route = route_name[which.max(ARQI_median)]) %>%
ungroup()
or_ratings
## # A tibble: 46,653 × 30
## users ratings route_id name grade type ...1 route_name type_string
## <chr> <dbl> <chr> <chr> <fct> <chr> <dbl> <chr> <chr>
## 1 f56f930bec… 3 1192561… Winds… 5.0 sport 70489 Winds of P… sport
## 2 04fd5a96a0… 3 1192561… Winds… 5.0 sport 70489 Winds of P… sport
## 3 6f83adad05… 3 1192561… Winds… 5.0 sport 70489 Winds of P… sport
## 4 54b89a6980… 2 1192561… Winds… 5.0 sport 70489 Winds of P… sport
## 5 f766420d4f… 4 1062665… South… 5.1 trad 96351 South Ridg… trad
## 6 5444a5e886… 4 1062665… South… 5.1 trad 96351 South Ridg… trad
## 7 4b7315ea87… 4 1062665… South… 5.1 trad 96351 South Ridg… trad
## 8 14bf58a87f… 4 1062665… South… 5.1 trad 96351 South Ridg… trad
## 9 5d469f9184… 4 1062665… South… 5.1 trad 96351 South Ridg… trad
## 10 968aefb501… 4 1062665… South… 5.1 trad 96351 South Ridg… trad
## # … with 46,643 more rows, and 21 more variables: sector_ID <dbl>,
## # parent_sector <chr>, parent_loc <chr>, num_votes <dbl>,
## # adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## # mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## # ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## # state <chr>, lon <dbl>, lat <dbl>, level <fct>, class <chr>,
## # best_route <chr>
Note that some routes with a lower ARQI may have a higher median rating. The ARQI takes the number of votes into consideration, allowing for a more accurate and fair route designation (we don’t want just any route falling into a classic).
or_ratings$class <- factor(or_ratings$class,
levels = c("classic", "area classic", "good", "bad", "bomb"))
or_ratings %>%
filter(num_votes < 400) %>% #filter outliers
select(num_votes, median_rating, class) %>%
distinct() %>%
ggplot(aes(num_votes, median_rating, group = class)) +
geom_point(aes(color = class), alpha = 0.6, position = position_jitterdodge(jitter.width = .9, jitter.height = 0.1)) +
scale_color_brewer(palette = "YlOrRd") +
theme_tufte() +
labs(x = "\nNumber of Votes\n", y = "\nMedian Rating\n", title = "Median Ratings vs Number of Votes by ARQI Class\n")

We see that there are a majority of routes in the Area Classic range (between 2.5 and 3.5) across all three states with outliers in the Bad class. But what about by grade?
or_ratings %>%
group_by(state) %>%
ggplot(aes(x = state, y = ARQI_median, fill = state)) +
geom_boxplot() +
scale_fill_brewer(palette = "Set3") +
theme_tufte() +
labs(x = "\nState\n", y = "ARQI\n", title = "Distribution of ARQI Scores by State\n")

We find that a majority of easy and intermediate routes are both classic and area classics, supporting my claim that anyone can climb a classic not only at their local crag but additionally famous big walls climbed by the legends.
or_ratings %>%
group_by(level, class) %>%
ggplot(aes(x = level)) +
geom_bar(aes(fill = class), position = "dodge") +
scale_fill_brewer(palette = "Set1") +
theme_tufte() +
labs(x = "\nLevel\n", y = "Number of Routes\n", title = "Distribution of Route Classes by Grade Level")

I plan to combine a recommendation and route ranking system in a map format using geolocation and ratings data. Specifically, I will use Plotly for interactivity, Mapbox for geocoding, and Shiny for construction of the web application. Here I am accessing my public token from Mapbox in order to do some basic plotting and interactivity with Plotly.
library(mapboxapi)
## Warning: package 'mapboxapi' was built under R version 4.1.2
## Usage of the Mapbox APIs is governed by the Mapbox Terms of Service.
## Please visit https://www.mapbox.com/legal/tos/ for more information.
my_token <- 'pk.eyJ1Ijoibmhlcm5hbmRlejE5OTkiLCJhIjoiY2wzZGZjZDEwMDFyajNjbDVxMnJ2M2lwdSJ9.N9R9fzcgvK1ieQ_s5eVwQw'
#mb_access_token(my_token, install = TRUE, overwrite = TRUE)
#readRenviron("~/.Renviron")
Sys.setenv('MAPBOX_PUBLIC_TOKEN' = my_token)
Sys.getenv('MAPBOX_PUBLIC_TOKEN')
## [1] "pk.eyJ1Ijoibmhlcm5hbmRlejE5OTkiLCJhIjoiY2wzZGZjZDEwMDFyajNjbDVxMnJ2M2lwdSJ9.N9R9fzcgvK1ieQ_s5eVwQw"
With some basic plotly commands, we can plot all the West Coast routes. Ideally the user will be able to filter grade range, type, location, and rating on the Shiny app.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
fig <- or_ratings %>%
plot_ly(lat = ~lat,
lon = ~lon,
mode = 'markers',
type = 'scattermapbox',
color = or_ratings$class,
hoverinfo = 'text',
text = paste("Parent wall: ", or_ratings$parent_sector,
"<br>",
"Best Route: ", or_ratings$best_route,
"<br>",
"Type: ", or_ratings$type_string,
"<br>",
"Class: ", or_ratings$class,
"<br>",
"ARQI: ", or_ratings$ARQI_median,
"<br>",
"Grade: ", or_ratings$grade)
) %>%
layout(
mapbox = list(
style = 'open-street-map', # or 'light'
zoom = 5,
center = list(lon = -120, lat = 44)
)
) %>%
config(mapboxAccessToken = Sys.getenv("MAPBOX_PUBLIC_TOKEN"))
fig